An Analysis of the Game of Chess

Queens and Openings

Contributors: Samantha Taskale, Ming Nelson, Nathan Blanken, Ben Drucker

Our Final Tutorial


Motivation

The four researchers have known each other for a few years now. During that time Ming and Ben have always clashed over who is the better chess player. Both Ming and Ben have similar skill levels so there is a constant back and forth arguments between who has the better strategies and they both spend a lot of time figuring out how to improve their game to beat the other. When this project came along it was perfect fit to be able to analyze chess games so both Ming and Ben can work on their game, and you the reader can implement our findings into your game as well. The following report delves into various aspects of the game of Chess and goes through all of the steps of the data science lifecycle.

If you are not familiar with the game of chess, here is a resource with steps on how to setup a game and basic rules: https://www.chess.com/learn-how-to-play-chess


Data Collection

Here we are pulling all of our data from https://www.kaggle.com/datasets/datasnaek/chess?resource=download. This is a set of just over 20,000 games collected from a selection of users on the site Lichess.org. This set contains:

Game ID;
Rated (T/F);
Start Time;
End Time;
Number of Turns;
Game Status;
Winner;
Time Increment;
White Player ID;
White Player Rating;
Black Player ID;
Black Player Rating;
All Moves in Standard Chess Notation;
Opening Eco (Standardised Code for any given opening, list here);
Opening Name;
Opening Ply (Number of moves in the opening phase)

For each of the separate games from Lichess. The owner of the dataset collected the data using the Lichess API, which enables collection of any given users game history.

In the code block below we pull out only the games that were "rated". Rated games are games that have an impact on a player's skill rating. The greater the rating level a player has, the greater their skill should be. When a rated game is won a player's rating increases and when a rated game is lost a player's rating decreases. We decided to pull out only rated games in order to better analyze the different openings which we discuss below. It is believed that during rated games players will try harder and play more seriously because their rating will be impacted with a loss.


Data Processing

The following block of code loops through the rated dataset and removes games that we are considering to be "bad games". We consider the following situations "bad games" thus we remove those games from our cleaned dataset:

  1. One or more queens are not moved during the length of the game.
  2. A pawn is promoted to a queen before both aueens are moved.

Here is some information about what is means to promote a pawn: https://www.chess.com/terms/pawn-promotion.

We plan to examine whether or not moving the queen early on in the game decreases the chance of winning the game. We want to examine a particular kind of game and in situations where the queen doesn't get moved and then a pawn is promoted to a queen is an abnormal, this skews the data and isn't within the focus of this examination.

To determine if a queen has moved we look at the string of chess moves, turn it into an array, and look at the first character of each move to see if it is a "Q". If we find a "Q" we know it is a move by the queen. To find a pawn promotion we look for "=Q" which is the chess notation for a promotion to a queen. A promotion to a queen can only be done by a pawn so seeing that notation makes it easy to find the promotion. We can use the fact that white pieces always move first to determine whether it the black or white queen that moved or if it was a black or white pawn that was promoted to a queen.

The first thing we will look into is the effect opening has on white and black's chances of winning. Openings are a way of describing the first phase of a chess game, classified by the order of the moves a player played. Here is a website where you can find more information about openings: https://chessfox.com/chess-openings-list/.

We want to sum the number of wins and ties for each of the colors (white and black). We will compile these numbers into a dictionary where all of the openings will correspond with [wins for black | wins for white | ties]. This begins to give us perspective on how the color and opening can impact a win. This can also show us if some openings are more favorable to one color or the other.

The function "top_n" will return the top n most used openings. This helps us identify openings that are played very frequently so that we can see a higher sample size of games played with a specific opening. Sorted by top n frquency so to get the top 50 used openings we can use this function to ensure we are looking at well tested openings.

This code snippet performs an analysis on the openings of the chess games. It identifies the top 50 most frequently used openings, sorts them by frequency, and creates a DataFrame to store the analyzed data. The DataFrame includes information such as the opening name, ECO code, number of wins for white and black players, number of draws, and the frequency of occurrence. By focusing on the top 50 most used openings, the code helps identify well-tested openings with a higher likelihood of success or failure.

Exploratory Analysis & Data Visualization

Most Common (and Successful) Openings

The following code is an analysis of win percentages for the top 50 most used openings and which piece color wins them the most. We decided to look at the 50 most played openings instead of the highest win rates for each respective color because there are many variations of openings that were only used one time and won. Those 100% win openings significantly skewed the results and we felt that it was not significant to the results we wanted to display. We instead only used the most frequently played and sorted it by how frequent the openings were used.

Results: When playing as white, the highest percentage of wins for the top 50 most commonly played openings came from the Queen's Gambit Refused with a win rate of about 70%. When playing as black, the highest percentage of wins for the top 50 most commonly played openings came from the Sicilian Defense: Alapin Variation with win rates of about 67%. Of the 50 most played openings the French Defense: Exchange Variation was the opening that led to the highest percentage of draws at about 10%.

It should be noted that some of the top 50 most commonly used openings had zero draws so they are still listed but have no bar.


Separating games by ECO

Every chess opening variation is part of a larger family of openings. Each family of openings is classified using an Opening ECO code. Opening ECO, which stands for Encyclopaedia of Chess Openings, represents a specific code for specific openings. Each one of these codes represents a way of classifying groups of openings. For example, openings with the code A00 are all variations of the Polish (Sokolsky) Opening. For more on opening ECO codes, you can visit this link: https://www.365chess.com/eco.php.

This code block is a simple function that grabs the top n number of openings based on how many times they are used.

There are too many variations of each ECO family so we decided to only grab the top 50 most common Opening ECOs. We then figure out how many wins each color has using that Opening ECO as well as how many draws there were.

After splitting up the data to find the 50 most commonly used Opening ECOs, we graph them to see win percentage for both white, black, and what percent of the time there were draws using these families of openings.

Results: For white the best opening ECOs are D06 (Queen's Gambit) and C24 (Bishop's Opening Berlin Defense). For black the best opening ECOs are D10 (Queen's Gambit Declined Slav Defense) and B22 (Sicilian Defense, Nimzowitsch Variation). For games where there was a draw, the most common openings where there were draws were B23 (Sicilian Defense, closed) and B40 (Sicilian Defense).

Gathering Win Percentage Based on Number of Moves Before Moving The Queen

There is a common strategy in chess that says you should not move your queen out early in the game. The reasoning for this is because you may be more likely to lose your queen if you move it out too early. We want to see if this is true.

The following code loops through the cleaned dataset and counts how many games each color won at a certain number of moves before moving the queen. This is important to our model because we believe there is a trend between the number of moves a player makes before moving the queen and the percent of the time they win. With this code we build the foundation of collecting parameters for our machine learning model.

Number of moves until queen was moved vs. win percentage for the player using the white pieces

Here we are creating a scatter plot with a regression line in order to visualize the relationship between the number of moves until the queen is moved and the win percentage of a player who is using white pieces. This graph provides us insight into whether there is a correlation between these factors.

Results: We can see from the above scatter plot that there is a positve slope correlation between the number of moves until the queen was moved and the win percentage for players using white pieces.

Number of moves until queen was moved vs. win percentage for the player using the black pieces

Here we are creating a scatter plot with a regression line in order to visualize the relationship between the number of moves until the queen is moved and the win percentage of a player who is using white pieces. This graph provides us insight into whether there is a correlation between these factors.

Results: We can see from the graph that there is a positive correlation between the number of moves until the queen was moved and and the win percentage for players. The increase in win percentage seems to be more dramatic for the black pieces.

So, in general, it seems as if the longer a player waits to move the queen, the greater that player's chance of winning becomes.


Residuals vs. Numbers of moves until queen was moved

Here we calculate the residuals by subtracting the predicted values from the actual values and create a scatter plot to visualize the relationship between the number of moves until the queen is moved (x-axis) and the residuals (y-axis). The residuals help us assess the deviation between the predicted and actual win percentages and provide insights into the accuracy of the regression model.

Results: As seen from these two graphs the points are more closely centered around 0 for the black pieces than the residuals for the white pieces but both graphs have many points that are far from zero. This shows that the quicker the queen is moved the better the model is at predicting a win. This might indicate that the quicker the queen is moved then the more likely someone is to win the game.


Model: Analysis, Hypothesis Testing, & Machine Learning

Ordinary Least Squares (OLS) Regression Analysis for Opening Name vs. White Win Percentage

We are curious as to how correlated the Opening Name is to white's win percentage. We decided to run an Ordinary Least Squares with the Opening Name being the independent variable and white's win percentage being the dependent variable. However, we first had to encode the opening names with a numerical value to properly fit the model, so each opening name got an index value.

Hypotheses

Results: As seen by the low R-squared values and high P-values it seems that opening names does not correlate significantly with the percentage of wins. Therefore, we fail to reject the null hypothesis.

Ordinary Least Squares (OLS) Regression Analysis for Queen Movement Having Impact on Win Percentage

Here we are using an OLS regression model to examine the relationship between the movement of a queen and its effect on the win percentage.

Hypotheses

White Queen Regression Results


We then did the same thing for black pieces by looking at the number of moves before the black queen was moved. We then did OLS regression.

Hypotheses

Black Queen Regression Results

As the scatter plots showed, it seems that there is more of a correlation between black waiting to move their queen and winning than white.


This code segment performs an F-test on the regression model to test the joint significance of all the coefficients in the model. It uses the identity matrix as the hypothesis matrix.

The F-test result will provide information about the overall significance of the regression model. The output will include the F-statistic value, the corresponding p-value, and other relevant statistics.

Results of F-test: As seen from both F-tests the p-values are very low, effectively zero, which suggests there is strong evidence against the null hypothesis and strong evidence for the alternative hypothesis.


ELO Rating & Bins

Now, we are curious about the effects of the opening that was played and ELO rating. An ELO rating is a numerical rating given to a chess player to dictate how strong of a chess player they are. For more information about ELO rating, visit https://www.hiarcs.com/hce-manual/pc/Eloratings.html.

We have made 10 bins of ELO ratings from 700 points to 2700 to help capture games in seperate bins. The players in each of these bins have similar skill levels.

Results: Above are printed the top 5 openings for each ELO range. Some notable trends are that the first 6 ranges, the lower rated players, all use similar openings, with little variation between each of them. Then, the last several ranges, the more highly rated players, use different openings than the first lower ranges. Furthermore, the openings used at the highest ranges are include more variations and are more complex. This makes sense because higher rated players must have more experience and are able to play more complicated openings.

A Visualization of Openings Used By Each ELO Range

Previously, we saw the top 5 openings for each ELO range, but that didn't give the clearest picture of the overal opening usage for each range. Next, we visualize the openings used by each ELO range.

To best show proportions, we chose to visualize our data using pie charts. The following pie charts include all openings used by players in each ELO range.

Results: The most notable aspect of the pie charts created above is the diversity of openings and how many there are. The lower rating ranges have bigger "pie slices" because fewer openings are played by those players. This is most likely because the lower rated players use simpler openings that are easier to understand. The general trend is that the higher the players are rated, the more different openings they use (ie. more variations). This trend seems to reverse at the highest rated ranges of players which is likely resulting from the fact that there are less games in our dataset played by 2501-2700.

Win Percentages by ELO Rating Grouped by ELO Range

Next, we want to visualize the win percentages for each opening used by players in each ELO range. We are curious as to whether or not a bigger ELO disparity results in more wins for the player with a higher ELO rating. To do this, we use the dictionary elo_openings used to make the above pie charts and cross reference the openings from this dictionary with the master openings dictionary, which contains the win percentage data needed.

For the purpose of readability, only openings from the top 50 openings are included for ranges in which the number of openings played is greater than 100.

Results: These bar graphs show the most effective openings for both players in each ELO range. These are valuable insights for players who are looking to improve their game. No matter what ELO a player is at, these graphs show the best openings that they can use.

As we get to highest ratings, we see that there are only six openings being played. This is a result of not having many high rated games as data points that qualify in our clean dataset.


ELO Rating and Opening Name

We created a dataframe containing columns for: White's ELO, Black's ELO, Opening Name, Opening Win Rate for White, Opening Win Rate for Black, Draw Rate for the Opening, White Result, Black Result, Number of moves until black queen was moved, and Number of moves until white queen was moved. Each row of the dataframe contains the information for every game that takes place from our cleaned_dataset. We will use this dataframe that we call "ratings_and_openings" to examine the effect that a player's ELO, opening, and the number of moves until the queen was moved has on whether they win or not.

Win Rate By ELO Difference

Another feature we wanted to examine was the impact that ELO difference had on the outcome of a game. We wanted to see how often the underdog player beat the higher rated player. We analyzed this from the perspectives of the white player and the black player and visualized them in the bar graphs below.

Results: When examining the following bar graphs it is evident that whether someone is playing as black or white if they have a higher ELO rating than their opponent then their win rate is greater than 50%. It was found that when a player's ELO is about 900 points greater than their opponent then they have a 100% chance of winning the game. This makes sense because someone with an ELO 900 points higher than their opponent is likely to have more experience, played longer, and understands more strategy to help them win. On the flip side when someone has an ELO lower than their opponent, regardless of the color they are playing as, their win rate significantly drops.


More Hypothesis Testing

We wanted to examine the statistics around ELO differences and win rates. After collecting the data we needed to create an OLS model, we came up with the following null and alternative hypothesis:

Null Hypothesis: ELO difference does not have a statistically significant effect on win rate for either player.

Alternative Hypothesis: ELO difference does have a statistically significant effect on win rate for either player.

Results for white win rate

As seen above the P-value is 0.000 which is below the alpha of 0.05 therefore we reject the null hypothesis. The R2 is 0.955 which means that 95.5% of the variation is explaned by our independent variable of the ELO difference. Being even more conservative in the given statistics, the adjusted R2 is 0.953 and the F-statistic is 403.5 which further shows how much of an impact the ELO difference has on white win rate.

Results for black win rate

As seen above the P-value is 0.000 which is below the alpha of 0.05 therefore we reject the null hypothesis. The R2 is 0.959 which means that 95.9% of the variation is explaned by our independent variable of the ELO difference. Being even more conservative in the given statistics, the adjusted R2 is 0.957 and the F-statistic is 448.1 which further shows how much of an impact the ELO difference has on black win rate.

Comparing the two

When comparing the statistics for ELO difference on both the white win rate as well as the black win rate it is clear that they the ELO difference plays a huge role in the outcome of games.


Building the Model

After figuring out what parameters seem to have a big impact based on the statistics we calculated, we will begin building our Machine Learning model. The goal of the model is to identify who will win the game given some information about the game. The information we give the model will be the features the model uses to predict the outcome.

For our first model, we will be using the difference in ELO Rating, the Win Rate for White of the specific opening, and how long it took to move the queen as our feature values. The goal of the model will be to predict whether white won or lost, which will be our y value.

Now, let's fit a decision tree model and make some predictions.

We will also fit a Logistic Regression model and test the accuracy of this one as well.

Results: It seems that the logistic regression is about 7% more accurate than the decision tree for the white pieces. Although the logistic regression accuracy is greater than that of the decision tree, it still is not very good. While 65.2% is better than a 50/50 toss up, it's not great.

We will do the same for black. Using the Rating difference and Opening Win Rate for Black as our feature values, we will try to predict if black won. Similarly for white, we will fit a Decision Tree and a Logistic Regression.

Results: Similar to the white pieces it seems that the logistic regression is about 7% more accurate than the decision tree for the black pieces. The logistic regression accuracy is also only about 66% which is not phenomenal.

It seems that we are able to predict the black pieces slightly better than the white pieces. However, this is likely just be a coincidence.


Improving the Models

Now we will try to improve the models. One way you can go about improving a machine learning model is changing the feature values you use in an attempt at finding features that correlate more with the variable we are trying to predict.

We noticed that the Opening ECO code win rate may be more correlative with the winner compared to Opening name win rate. We suspect that this is the case because the ECO code represents a family of openings that share a similar move order, so it may be the overarching tactics of the opening that correlates more to winning rather than a specific move order of the game.

We will attempt to improve the model by using the Opening ECO win rate as one of our feature values along with the ELO Rating Difference and how long it took to move the queen. Like previously, we will fit our Decision Tree and Logistic Regression Models for both White and Black.

ELO Rating and ECO Code

We also want to be able to examine the effects of a player's ELO and the opening ECO they played on their chance of winning. We created a dataframe containing columns for: White's ELO, Black's ELO, Opening ECO, ECO Win Rate for White, ECO Win Rate for Black, Draw Rate for the ECO, White Result, Black Result, NUmber of moves until black queen moved, and Number of moves until white queen moved. Each row of the dataframe contains the information for every game that takes place from our cleaned_dataset. We will use this dataframe that we call "ratings_and_eco" to examine the effect that a player's ELO and opening ECO has on whether they win or not.

Opening ECO vs. White Win Percentage

Similarly to Opening Name, we are also curious whether Opening ECO has a strong correlation to white win percentage.

We ran Ordinary Least Squares regression with Opening ECO encoded with a numerical values as the independent variable and white win percentage as the dependent variable.

Results: Compared to the P-value of opening name versus win percentage, opening ECO code correlates much better with win percentage. The opening name OLS has a p-value of 0.802 and the opening ECO code shown above has a p-value of 0.354. Therefore, we believe that because of the stronger correlation with ECO it may improve our model.

Results: Using ECO instead of opening name, we see that the decision tree model was slightly more accurate going from 57% accuracy to 60% accuracy. However, the Logistic Regression model went down 1% point so this change may not tell us anything for white.

Results: For black, using the ECO as a feature value yields a slightly higher decision tree accuracy by about 1%. For Logistic Regression, using ECO was about 2% worse.

Insights and Conclusion

Although the statistics from the OLS results showed that ECO opening has a better correlation to win percentage than opening name, the model did not prove to be much (if any) more accurate than using the features from the opening name. This was a bit shocking to us as there was quite a big difference in the p-values from the ECO opening and opening name OLS results. Although the model was not significantly improved, these results are important as it shows that using a specific opening versus using any opening from the same particular family does not have a huge impact on whether someone wins or not. Chess is all about being able to predict your opponent's next move and ultimately that comes from experience. As someone plays more chess they will naturally learn more openings, techniques, and skills to beat their opponent. Just like most things in life, shortcuts don't always yield better results and in this case we have found the same result. Learning specific openings will not be the reason you win a game but rather having a higher ELO than your opponent which comes from playing more.

After examining all of the openings, ELO difference patterns, and how long someone should wait to move their queen, Ben and Ming each now have more techniques to improve their games. From the models we have created it looks like Ming and Ben's best chance of winning are to play a lot to increase their ELO and learn openings used by highly ranked players. This data is not just useful for Ben and Ming, as beginner chess players with some knowledge about statistics and data science can examine the results provided above to become a stronger player.